feat: capture-aware alloc + workspace pre-alloc (T2.2, T2.3)#97
Merged
feat: capture-aware alloc + workspace pre-alloc (T2.2, T2.3)#97
Conversation
…sync When CaptureAwareAllocator is active (set by BeginCapture/WithCapture), allocWeight routes through cudaMallocAsync on the capture stream so allocations are recorded as graph nodes. This avoids the silent hang caused by cudaMallocManaged during CUDA graph capture on GB10. Similarly, uploadBytes routes through cudaMemcpyAsync on the capture stream instead of the synchronous CPU copy used by the managed-memory path, which is illegal during capture. The ensureNotCapturing guard now only fires when capture is active but the allocator was NOT properly switched via BeginCapture/WithCapture. Changes: - Add IsCapturing() to CaptureAwareAllocator interface - Implement IsCapturing() on cuda.MemPool and gpuapi.CUDAMemPool - Add async allocation/copy routing in allocWeight and uploadBytes - Add function variable indirections for MallocManaged, MallocAsync, and MemcpyAsync to enable CPU-mock testing - Add 7 unit tests covering all routing paths
…o avoid capture-time alloc Add preAllocateWorkspaces() that eagerly initializes the FP8 scratchpad (scaleOne pointer + struct) and cuBLASLt handle at the end of UploadWeights, before any CUDA graph capture region begins. These two objects previously used lazy initialization (getFP8Scratch, getLtHandle) which triggered cudaMalloc on first use -- hanging silently on GB10 when first use happened inside capture. Also add captureAllocCount atomic counter to track allocWeight attempts during active capture. EndCapture resets the counter and logs a warning if non-zero. CaptureAllocCount() exposes the counter for testing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 4b of the GB10 CUDA graph capture fix (docs/plan.md E2). This is the core fix that resolves the silent hang described in #93.
allocWeightrouting. WhenCaptureAwareAllocator.IsCapturing(), routes throughcudaMallocAsyncon the capture stream (graph node) instead ofMallocManaged(illegal during capture on GB10). Similarly,uploadBytesroutes throughcudaMemcpyAsyncH2D during capture. AddedIsCapturing()toCaptureAwareAllocatorinterface + implementations. 7 new tests.preAllocateWorkspaces()called at end ofUploadWeightseagerly initializes FP8 scratchpad and cuBLASLt handle so no lazy alloc occurs inside capture. AddedcaptureAllocCountatomic counter that instruments capture-time allocs — should be zero for a properly pre-allocated workload. 7 new tests.Together with T2.1a (
WithCapturehelper, PR #96) and T4.1 (capture watchdog, PR #96), this completes the E2+E4 fix path. The production hang in #93 is now resolved: callers useWithCapture→ allocator switches to capture-aware mode →allocWeightuses async alloc → no illegalMallocManagedduring capture → no hang.Refs #93.
Verification
go build ./...PASSgo test ./compute/... -race -timeout 120sPASS (14 new tests, 2.7s)Test plan
go build ./...go test ./compute/... -race -timeout 120s